AITopics | sharp and flat local minima

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Neural Information Processing SystemsDec-24-2025, 23:11:46 GMT

Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which are correlated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Instead, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side - we formally define such minima as asymmetric valleys. Under mild assumptions, we first prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact empirical minimizer. Then, we show that performing weight averaging along the SGD trajectory implicitly induces such biased solutions. This provides theoretical explanations for a series of intriguing phenomena observed in recent work [25, 5, 51]. Finally, extensive empirical experiments on both modern deep networks and simple 2 layer networks are conducted to validate our assumptions and analyze the intriguing properties of asymmetric valleys.

asymmetric valley, name change, sharp and flat local minima, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Reviews: Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Neural Information Processing SystemsJan-21-2025, 04:55:19 GMT

Summary: The authors analyse the energy landscape associated with the training of deep neural networks and introduce the concept of Asymmetric Valleys (AV), local minima that cannot be classified as sharp or flat local minima. AV are characterized by the presence of asymmetric directions along which the loss increases abruptly on one side and is almost flat on the other. The presence of AV in commonly used architectures is proven empirically by showing that asymmetric directions can be found with decent probability'. The authors explain why SGD, with averaged updates, behaves well (in terms of the generalization properties of the trained model) in the proximity of AV. Strengths: The study of neural networks' energy landscape is a recent important topic.

asymmetric valley, neural network, sharp and flat local minima, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.83)

Add feedback

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Neural Information Processing SystemsOct-9-2024, 10:18:20 GMT

Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which are correlated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Instead, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side – we formally define such minima as asymmetric valleys. Under mild assumptions, we first prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact empirical minimizer.

asymmetric valley, sharp and flat local minima

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.63)

Add feedback

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

He, Haowei, Huang, Gao, Yuan, Yang

Neural Information Processing SystemsMar-18-2020, 21:30:52 GMT

Despite the non-convex nature of their loss functions, deep neural networks are known to generalize well when optimized with stochastic gradient descent (SGD). Recent work conjectures that SGD with proper configuration is able to find wide and flat local minima, which are correlated with good generalization performance. In this paper, we observe that local minima of modern deep networks are more than being flat or sharp. Instead, at a local minimum there exist many asymmetric directions such that the loss increases abruptly along one side, and slowly along the opposite side – we formally define such minima as asymmetric valleys. Under mild assumptions, we first prove that for asymmetric valleys, a solution biased towards the flat side generalizes better than the exact empirical minimizer.

asymmetric valley, sharp and flat local minima

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.63)

Add feedback

Filters

Collaborating Authors

sharp and flat local minima

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Reviews: Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Asymmetric Valleys: Beyond Sharp and Flat Local Minima

Asymmetric Valleys: Beyond Sharp and Flat Local Minima